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Hepatitis C is considered as a common infection in Egypt, especially genotype 4. The prognosis of hepatitis C and 
the risk of developing cirrhosis are related to the stage of fibrosis. Liver biopsy is the best indicator for identifying the 
extent of liver fibrosis, but it has many draw-backs. Furthermore, it is costly and susceptible to sampling error. 
Non-invasive methods for the assessment of liver fibrosis are alternative in staging chronic liver diseases. The aim of this 
paper is to develop a simple multi-linear model to predict the levels of risk for liver fibrosis based on standard laboratory 
tests. In this proposed model, liver fibrosis was assessed via Metavir score; patients were categorized as mild (F0-F1), 
moderate (F2), or advanced (F3-F4) fibrosis stages. Statistical analysis was performed using Med Calc software. 
The relationship between serum markers and the presence of significant fibrosis was assessed. 

The P-value and the correlation coefficients revealed that, age, AST, AFP, Albumin, platelet count, Glucose, 
Postprandial Glucose test and BMI, were significantly associated with fibrosis. Multi-linear regression analysis is 
performed to develop a model for prediction of liver fibrosis scores based on serum markers. Sensitivity and Receiver 
Operating Characteristic (ROC) curve analysis were performed to evaluate the proposed model. In training set, the area 
under the receiver operating curve (AUROC) for differentiating mild fibrosis from others is 0.78; with sensitivity 68.8 and 
specificity 75.2 at cutoff point <1.5, and for differentiating advanced from others is 0.82; with sensitivity 82.48 and 
specificity 78.3 at cutoff point >1.7. It has been concluded that, multi-linear regression model can predict fibrosis stages in 
chronic hepatitis C with accepted accuracy that could be used to reduce the need to assess the liver biopsy. 
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INTRODUCTION 

Hepatitis C is an infectious disease of the liver caused by the hepatitis C virus (HCV).HCV infection can cause 
chronic hepatitis which can result in serious long term consequences including cirrhosis, heap to cellular cancer (HCC), 
liver failure and need for transplantation [1]. According to World Health Organization (WHO) in 2013, HCV infection is 
widespread throughout the world; every year, 3—4 million people are infected with the hepatitis C virus. About 150 million 
people are chronically infected and at risk of developing liver cirrhosis and/or liver cancer, and more than 350,000 people 
die every year suffering from hepatitis C-related liver diseases. The prevalence of anti HCV antibody varies in different 
world countries with high reported rates in Egypt [2-4]. The prognosis of hepatitis C and the risk of developing cirrhosis 
are highly related to the stage of fibrosis. Liver biopsy is considered as one of the most powerful for assessment of 
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histological activity and stage of disease. However, it is not gold standard [5]. It has potential risks due to its invasive 
nature, and the histological assessment that may suffer from variability of results. Furthermore, it is costly and susceptible 
to sampling error [6]. 

According to these limitations other alternative, non-invasive tests to assess fibrosis are currently developed [7]. 
Some non invasive tests based on indexes derived from serum markers [8-11], such as FIB-4 score and the as part ate 
aminotransferase (AST)-to-platelet ratio index (APRI) [12, 13].Others based on imaging techniques, such as using 
Transient Elastography (TE), which used ultrasound and vibratory waves for estimating the extent of liver fibrosis [14-1 8]. 

The aim of this work is to propose a non- invasive mathematical model to predict the levels of risk for liver 
fibrosis based on standard laboratory tests. It would provide very useful information to help reduce the use of liver biopsies 
of chronic hepatitis C (CHC) patients, and helping in reduce the pain that the patient exposed in biopsy process. 

PATIENTS AND METHODS 
Patients 

The Dataset of blood serum for 1401 patients has been investigated and analyzed from National Liver Institute in 
Cairo. 173 additional patients were consequently enrolled in the study as the testing set. The data contains reported clinical 
information regarding, but not limited, to the following: age, gender and body mass index (BMI), histological findings 
such as grade of fibrosis and the activity, and laboratory tests such as albumin, total bilirubin, indirect bilirubin, alanine 
aminotransferase (ALT), as part ate aminotransferase (AST), alfa-feto protein (AFP), alkaline phosphatase (ALP), 
gamma-glutamyl transferase (GGT), International Normalized Ratio (INR), quantity of HCV_RNA, white blood cells 
(WBC) count, Hemoglobin (Hb), platelet, creatinine, serology finding, Glucose, Postprandial glucose test (PC%), 
and HDL-cholesterol. The data were statistically analyzed using the Med Calc, and Microsoft Excel. 

Methods 

Liver histology is determined via METAVIR score as assessed by local pathologists (including 13 centers around 
Egypt). Total histological activity index and fibrosis scores (F0-F4) were recorded. According to the METAVIR system, 
fibrosis was staged on a scale from F0 toF4, as follows: F0: no fibrosis; Fl: portal fibrosis, without septa; F2: few septa; 
F3: many septa without cirrhosis andF4: cirrhosis, respectively. F0 and Fl were considered as mild fibrosis, and F2, F3 and 
F4 as significant; whereas F3-F4 considered as advanced fibrosis [19]. 

Statistical Analysis 

Statistical analysis has been performed with MedCalc software (MedCalc Software is a developer of medical and 
statistical software solutions, version 12.5; 1993-2013). Three possible outcomes have been considered for the primary end 
point: insignificant or mild fibrosis (F0-F1), moderate fibrosis (F2), and advanced fibrosis (F3-F4). The relationship 
between variables and the presence of significant fibrosis has been assessed. The Kruskal-Wallis Test has been used for 
continuous variables with non-normal distribution. The Chi-square test has been used for categorical variables. Pearson 
correlation coefficients between fibrosis and each variable have been assessed. 

To compute the Kruskal-Wallis test (H-test) statistic, first, all the samples are combined, then the combined values 
are ordered from low to high, and at last the ordered values are replaced by ranks, starting with 1 for the smallest value. 
The statistic used for the Kruskal-Wallis test is designated H [20]. Its formula is: 
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WhereX^i, Z^i--- Y^k^e the sums of the ranks of samples 1, 2... k, respectively, «i, «2, n^are the sizes of 
samples 1, 2, k, respectively, n is the combined number of observations for all sample, and k is the number of 
population. 

The Chi-squared statistic is the sum of the squares of the differences of observed and expected frequency divided 
by the expected frequency for every cell: 



If the calculated P-value is less than 0.05, then there is a statistically significant relationship between the two 
classifications [21]. 

Correlation analysis is the study of the relationship between two variables. The correlation coefficients describe 
the strength of the relationship between variables X and Y. Pearson's correlation coefficient between two variables is 
defined as the covariance of the two variables divided by the product of their standard deviations. The correlation 
coefficient r, gives values from -1.00 to +1.00, where 1 is total positive correlation, 0 is no correlation, and -1 is total 
negative correlation. A correlation coefficient r closed to 0 shows that the linear relationship is quite weak. A correlation 
coefficient r closed to -1.00 or +1.00 indicates very strong correlation between the two variables [20]. 

Multiple regression analysis has been performed using the training data set to develop models for prediction of 
fibrosis considering the following variables at baseline: age, gender, BMI, AST, ALT, AFP, ALP, indirect bilirubin, total 
bilirubin, HDL cholesterol, quantity of HCV_RNA, glucose, PC%, serology, total WBC count, platelet count, 
GGT, creatinine, Hb, INR, and albumin. 

Multiple linear regression is a method of analysis for assessing the strength of the relationship between each of a 
set of explanatory variables (sometimes known as independent variables), and a single response (or dependent) variable. 
When only a single explanatory variable is involved, we have what is generally referred to as simple linear regression. 
Applying multiple regression analysis to a set of data results in what are known as regression coefficients, one for each 
explanatory variable. 

The multiple regression model for a response variable, y, with observed values, }\, y 2 , y a (where n is the 
sample size) and q explanatory variables, x L , r 3 , . . ., x^ with observed values, x^, x z ^ . .., x^ for i = 1, . . ., n, is: 



The term Ej is the residual or error for individual i and represents the deviation of the observed value of the 
response for this individual from that expected by the model. These error terms are assumed to have a normal distribution 
with variance a 2 . The regression coefficients,/^, /? q are generally estimated by least squares[22, 23]. 

Sensitivity, specificity, and ROC analysis were performed to evaluate the proposed model. Then the fibrosis index 
derived from the training set was applied to the testing set to test its accuracy. 
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ROC Curve and Optimal Threshold Point 

Receiver Operating Characteristic (ROC) curve analysis [24, 25] is an effective method for assessing the 
performance or the accuracy of a diagnostic test [26]. The area under the ROC curve (AUROC) is a value that varies from 
0.5 represents a worthless test; to 1 represents a perfect test. 

When you consider the results of a particular test in two populations, one population with a disease, the other 
population without the disease, you will rarely observe a perfect separation between the two groups. Indeed, the 
distribution of the test results will overlap, as shown in figure 1 . 
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Figure 1: The Overlap of Test Results for Two Populations [MedCalc] 

For every possible threshold point or criterion value you select to discriminate between the two populations, there 
will be some cases with the disease correctly classified as positive (TP = True Positive fraction), but some cases with the 
disease will be classified negative (FN = False Negative fraction). On the other hand, some cases without the disease will 
be correctly classified as negative (TN = True Negative fraction), but some cases without the disease will be classified as 
positive (FP = False Positive fraction) [21]. The Sensitivity is the probability that a test result will be positive when the 
disease is present (true positive rate, expressed as a percentage), and itis expressed as equation 4. The Specificity is the 
probability that a test result will be negative when the disease is not present (true negative rate, expressed as a percentage), 
and itis expressed as equation 5 [26]. 

TP t 

s n= / ftp + FN) (4) 
Where S N , TP, FN are sensitivity, true positive rate, and false negative rate respectively. 

S f = TN /(FP +- TN) (5) 
Where S P , FP, TN are specificity, false positive rate, and true negative rate respectively. 

In a Receiver Operating Characteristic (ROC) curve is a graphical plot which illustrates the performance of a 
binary classifier system as its discrimination threshold is varied. It is created by plotting the true positive rate (S N ) vs. the 
false positive rate (100-Sp) for different threshold points; as shown in figure 2. A test with perfect discrimination 
(no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% 
specificity). Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test 
(Zweig & Campbell, 1993) [21]. 

Sensitivity is inversely related with specificity in the sense that sensitivity increases as specificity decreases across 
various threshold. 
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Criteria are methods to find the optimal threshold point; which gives maximum correct classification. 
Three criteria are used to find optimal threshold point from ROC curve, the Youden index, the closest to (0, 1) criterion, 
and the minimize cost criterion. First two methods give equal weight to sensitivity and specificity and impose no ethical, 
cost, and no prevalence constraints. The third criterion considers cost which mainly includes financial cost for true and 
false diagnosis. This method is rarely used in medical literature because it is difficult to estimate the respective costs. 
Youden index is more commonly used criterion because this index reflects the intension to maximize the correct 
classification rate and is easy to calculate. In this proposed model the Youden index has been used. 

Youden Index Criteria maximizes the vertical distance from line of equality to the point [x, y] as shown in the 
figure 2. The x-axis represents (100- specificity) and y-axis represents sensitivity. 




(q q) 100-Specifidty (True Negative Rate) (1,0) 
S N =0%,S P =100% S N =0%,S P =0% 
=Sensitivity, S P =Specificity 



Figure 2: ROC Curve and its Components [19] 

The main aim of Youden index is to maximize the difference between true positive rate TPR (S w ) and false 
positive rate FPR (1 - Sp) using the equation 6 [24]. 

/ = rnax[£ N + S p ] (6) 

Where J, Sn, Sp are Youden index, sensitivity, and specificity respectively 

RESULTS 

Patient Characteristics and Effective Markers Prediction 

The characteristics of patients in training dataset and the results of correlation and P value are shown in Table 1 
and figured in figures 3 and 4. Data are expressed as mean + SD unless otherwise stated. The correlation and P-value 
results as shown in figures 3 and 4, identified age, AFP, AST, BMI, Platelet, albumin, Glucose and PC% as independent 
predictors of fibrosis, with highest statistically significant relationship (P-value < 0.01) and accepted correlation (lrl>0.1) 
with fibrosis. 
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Table 1: Characteristics of Patients in the Dataset 



Characteristics 


Dataset 


N 


Correlation 


P-Value 


Age 


40.95+10.58 


1401 


0.3066 


P<0.000001 


Gender 
Female 
Male 


(28%) 
(72%) 


1401 




0.7894 


BMI 


27+3.6 


1401 


0.1341 


0.000017 


AFP 


6.9+0.4 


1401 


0.1499 


P<0.000001 


ALP 


94.76+56.26 


1358 


0.04292 


0.550191 


AST 


51. 53+30. 1 


1401 


0.2035 


P<0.000001 


ALT 


61.6+38.26 


1401 


0.0835 


0.174605 


Platelet 


217.6+64.5 


1401 


-0.2491 


P<0.000001 


Albumin 


4.3+0.41 


1401 


-0.1185 


0.000065 


Indirect Bilirubin 


0.55+0.53 


823 


-0.002061 


0.692066 


Total Bilirubin 


0.77+0.39 


1377 


0.08275 


0.000053 


Glucose 


97.83+22.2 


1376 


0.1118 


0.000797 


PC% 


88.51+12.04 


1363 


-0.1109 


0.000047 


Creatinine 


0.86+0.19 


1339 


-0.0358 


0.123244 


Hemoglobin(Hb) 


13.98+1.68 


1399 


0.03658 


0.024753 


WBC 


6.38+1.93 


1400 


-0.04402 


0.737154 


HDL-Cholesterol 


47.18+22.64 


17 


0.01729 


0.661045 


GGT 


65.1+55.6 


17 


0.4964 


0.489583 


HCV_RNA_Quantitive 


4164818.41+51473599.15 


1372 


0.009354 


0.149538 


INR 


1.09+0.12 


638 


0.116 


0.00964 


Serology 

Positive 

Negative 


380(36.68%) 
656(63.32%) 


1036 


0.08805 


0.0344 



N= number of patients whose assessed that test. 



There are 790 Patient didn't assess their Glucose or PC% records, therefore they would be excluded from the 
training set to be 612 patient with full data. The distribution of fibrosis Stages and the three strata among training and 
testing sets are shown in table 2. 
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Figure 3: Bar Chart of P- Value Using Kruscal Wallis for Data Set Variables 
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Figure 4: Bar Chart of Absolute Correlation Values between Each 
Variable in the Data Set and Fibrosis Scores 



The Proposed Model 

The Linear Multiple Regression has been estimated using MedCalc software with respect of the significant 
variables which are age, AST, AFP, BMI, platelets count, albumin, glucose, and postprandial glucose test. Based on the 
linear multiple regressions the following model has been driven: 

Fibrosis =02334 + 0,0247 *Ags{yr) + 0.0031 *AST(!U/L)+ 0.0028 *AFP(W/L) - 0,0007 *BM\ 0,0559 * 

AUnanin[g/dL) - 0,0027 * Flutists - 0,0006 * Clwose - 0,0001 * PC% + 0,4631 * WR ^ 
Where Fibrosis is the fibrosis strata of the patient 



Table 2: Fibrosis Records and Strata in the Dataset 



Fibrosis 
Stage 


Training 
Dataset 
(n = 612) 


Testing 
Dataset 
(n=173) 


0 


12 


0 


1 


386 


120 


2 


108 


35 


3 


104 


18 


4 


2 


0 


Total 


612 


173 


Fibrosis Strata 


Mild (0-1) 


398 


120 


Moderate (2) 


108 


35 


Advanced (3-4) 


106 


18 



The ROC Analysis 

Receiver operating characteristic curves plots for proposed mathematical model are shown in figure 5. Figure 5(a) 
presents ROC plot for differentiating mild fibrosis (F0-F1) from moderate to advanced fibrosis (F2-F4). Figure 5(b) shows 
ROC plot for differentiating advanced fibrosis (F3-F4) from mild to moderate fibrosis (F0-F2). The area under the curve 
(AUC) in differentiating mild fibrosis from others was (0.7765), and in differentiating mild fibrosis from others it was 
(0.8248). Table 3, addresses ROC analysis, the sensitivity, and the specificity results. 
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Table 3: ROC Analysis of Multiple Linear Regression Model Output in the Training Set 





AUROC 


Criteria Point 


Sensitivity % 


Specificity % 


F01 vs. F24 


0.7765 


< 1.5449 


68.8 


75.2 


F34 vs. F02 


0.8248 


> 1.6823 


78.3 


74.3 




100-Specificity 1 00-Specificity 

( a ) ( b ) 

Figure 5: ROC Curve of the Proposed Model (a) ROC Plot for Differentiating F01 from F24 
(b) ROC Plot of Differentiating F34 from F02 

Validation 

The fibrosis index derived from the training set was applied to the testing set to test its accuracy. Then the FIB -4 
index was applied to the validation set and compared with the ROC analysis of the proposed model for more confirmation. 
The ROC analyses of proposed model and FIB-4 in the testing set is summarized in table 4, and plotted in figure 6. 
Figure 6(a) shows ROC of differentiating F01 from F24 using the Proposed model with AUROC 0.66. Figure 6(b) shows 
ROC of differentiating F34 from F02 using the Proposed model with AUROC 0.72. Figure 6(c) shows ROC of 
differentiating F01 from F24 using FIB-4 index with AUROC 0.68. Figure 6(d) shows ROC of differentiating F34 from 
F02 using FIB-4 index with AUROC 0.72. 



Table 4: ROC Analysis of Proposed Model and FIB-4 Index in the Testing Set 



Testing Set 






AUROC 


Criteria 
Point 


Sensitivity 

% 


Specificity 

% 


Proposed 
model 


F01 vs. F24 


0.6579 


< 1.5 


65.8 


50.9 


F34 vs. F02 


0.7219 


>1.5 


63.8 


63.2 


FIB-4 


F01 vs. F24 


0.6832 


<0.7132 


50.8 


81.1 


F34 vs. F02 


0.7154 


>0.9412 


77.8 


64.5 




( c > ( d ) 



Figure 6: ROC Analyses for Testing Set (a) ROC of Differentiating F01 from F24 Using the 
Proposed Model (b) ROC of Differentiating F34 from F02 Using the PROPOSED Model 
(c) ROC of Differentiating F01 from F24 Using FIB-4 Index (d) ROC of Differentiating 

F34 from F02 Using FIB-4 Index 
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CONCLUSIONS 

In this study, Age, BMI, AST, AFP, Platelet, Albumin, glucose, PC%, and INR have been found to be 
significantly high correlated (P<0.01) to fibrosis stages. A simple Multi-Linear Regression Model for predicting fibrosis 
scores has been developed and tested. In training set, the AUC in differentiating mild fibrosis from others is (0.78), and in 
differentiating mild fibrosis from others is (0.82). Applying the proposed model in the testing set gave higher AUROC 
results for prediction of F34 (0.77) than applying FIB-4 (0.73), and lower AUROC results for prediction of F01 (0.71) than 
applying FIB-4 (0.73), otherwise different between them was not so far. It could be concluded that the proposed model can 
accurately differentiate mild fibrosis from moderate to advanced fibrosis, and advanced fibrosis from mild to moderate 
fibrosis. The developed approach could be used as a powerful, safe, and low cost alternative for predict strata of fibrosis 
rather than relatively risky alternative tools (such as the liver biopsy) in Chronic Egyptian Hepatitis C Virus Patients. More 
work on improving the accuracy of the prediction will be done in the future using different machine- learning models. 
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